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Abstract — An extension of the traditional two-armed bandit 
problem is considered, in which the decision maker has access 
to some side information before deciding which arm to pull. 
At each time t, before making a selection, the decision maker 
is able to observe a random variable X t that provides some 
information on the rewards to be obtained. The focus is on 
finding uniformly good rules (that minimize the growth rate of 
the inferior sampling time) and on quantifying how much the 
additional information helps. Various settings are considered and 
for each setting, lower bounds on the achievable inferior sampling 
time are developed and asymptotically optimal adaptive schemes 
achieving these lower bounds are constructed. 

Index Terms — Two-armed bandit, side information, inferior 
sampling time, allocation rule, asymptotic, efficient, adaptive. 



I. Introduction 

SINCE the publication of [1], bandit problems have at- 
tracted much attention in various areas of statistics, con- 
trol, learning, and economics (e.g., see [2], [3], [4], [5], [6], 
[7], [8], [9], [10]). In the classical two-armed bandit problem, 
at each time a player selects one of two arms and receives 
a reward drawn from a distribution associated with the arm 
selected. The essence of the bandit problem is that the reward 
distributions are unknown, and so there is a fundamental trade- 
off between gathering information about the unknown reward 
distributions and choosing the arm we currently think is the 
best. A rich set of problems arises in trying to find an opti- 
mal/reasonable balance between these conflicting objectives 
(also referred to as learning versus control, or exploration 
versus exploitation). 

We let {Y^} and {Y^} denote the sequences of rewards 
from arms 1 and 2 in a two-armed bandit machine. In 
the traditional parametric setting, the underlying configura- 
tions/distributions of the arms are expressed by a pair of 
parameters C = {61,82) such that {X- 1 } and {F T 2 } are 
independent and identically distributed (i.i.d.) with distribution 
{Fq 1 ,Fq 2 ), where {Fg} is a known family of distributions 
parametrized by 8. The goal is to maximize the sum of the 
expected rewards. Results on achievable performance have 
been obtained for a number of variations and extensions of 
the basic problem defined in [9] (e.g., see [11], [12], [13], 
[14], [15], [16], [17]). 

In this paper, we consider an extension of the classical two- 
armed bandit where we have access to side information before 
making our decision about which arm to pull. Suppose at 
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time t, in addition to the history of previous decisions, out- 
comes, and observations, we have access to a side observation 
X t to help us make our current decision. The extent to which 
this side observation can help depends on the relationship of 
Xt to the reward distributions of Y^ and Yf . 

Previous work on bandit problems with side observations 
includes [18], [19], [20], [21], [22]. Woodroofe [21] consid- 
ered a one-armed bandit in a Bayesian setting, and constructed 
a simple criterion for asymptotically optimal rules. Sarkar 
[20] extended the side information model of [21] to the 
exponential family. In [19], Kulkarni considered classes of 
reward distributions and their effects on performance using 
results from learning theory. Most of the previous work with 
side observations is on one-armed bandit problems, which can 
be viewed as a special case of the two-armed setting by letting 
arm 2 always return zero. 

In contrast with this previous work, we consider various 
general settings of side information for a two-armed bandit 
problem. Our focus is on providing both lower bounds and 
bound-achieving algorithms for the various settings. The re- 
sults and proofs are very much along the lines of [8] and 
subsequent works as in [11], [12], [13], [14], [15]. 
We now describe the settings considered in this paper. 
1) Direct Information: In this case, X t provides informa- 
tion directly about the underlying configuration Co = 
{81,82), which allows a type of separation between 
the learning and control. This has a dramatic effect 
on the achievable inferior sampling time. Specifically, 
estimating {81, 82) by observing {Xr}, and using the es- 
timate (6*i, 82) to make the decision, results in bounded 
expected inferior sampling time. 
If the distribution of {X T } is not a function of Co, we 
are not able to learn Co through {X r }. However, different 
values of the side observation X t will result in different 
conditional distributions of the rewards Y t l . By exploiting this 
new structure (observing Xt in advance), we can hope to do 
better than the case without any side observation. 

A physical meaning about the above scenario (constant 
distribution on {X T }) is that a two-armed bandit with the side 
observations drawn from a finite set {x±, X2, ■ ■ ■ ,x n } can be 
viewed as a set of n different two-armed sub-bandit machines 
indexed from x\ to x n . The player does not know the order 
of sub-machines he is going to play, which is determined by 
rolling a die with n faces. However, by observing X t , the 
player knows which machine (out of the n different ones) 
he is facing now before selecting which arm to play. The 
connection between these sub-machines is that they share the 
same common configuration pair {81,82), so that the rewards 
observed from one machine provide information on the com- 
mon {81,82), which can then be applied to all of the others 
(different values of X t ). This is the key aspect that makes this 
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setup distinct from simply having many independent bandit 
problems with random access opportunity. 

We consider the following three cases of different relation- 
ships among the most rewarding arm, C , and X t . 

2) For all possible Co, the best arm is a function of 
X t : That is, V(#i, 62), 3x\,X2 such that at time t, arm 1 
yields higher expected reward conditioned on X t — xi 
while arm 2 is preferred when X t — X2- Surprisingly, 
we exhibit an algorithm that achieves bounded expected 
inferior sampling time in this case. Woodroofe's result 
[21] can then be viewed as a special case of this scenario. 

3) For all possible C , the best arm is not a function 
of X t : In this case, for all configurations (0i,9 2 ), one 
of the arms is always preferred regardless of the value 
of Xt. Since the conditional reward distributions are 
functions of X t , the intuition is that we can postpone our 
learning until it is most advantageous to us. We show 
that, asymptotically, our performance will be governed 
by the most "informative" bandit (among the different 
values taken on by X t ). 

4) Mixed Case: This is a general case that combines the 
previous two, and contains the main contribution of 
this paper. For some possible configurations, one arm 
may always be preferred (for any X t ), while for other 
possible configurations, the preferred arm is a function 
of Xt- We exhibit an algorithm that achieves the best 
possible in either case. That is, if the best arm is a 
function of X t , it achieves bounded expected inferior 
sampling time as in Case 2, while if the underlying 
configuration is such that one arm is always preferred, 
then we get the results of Case 3. 

Our paper is organized as follows. In Section II, we in- 
troduce the general formulation. In Section III, we provide 
background on the asymptotic analysis of traditional bandit 
problems (without side observations). In Sections IV through 
VII, we consider the above four cases respectively. The results 
are included in each section, while details of the proofs are 
provided in the appendix. 

II. General Formulation 

Consider the two-armed bandit problem defined as follows. 
Suppose we have two sequences of (real-valued) random 
variables (r.v.'s), {Y*} i= i^, and an i.i.d. side observation 
sequence {X T }, taking values in X C K. {Y*} denotes the 
reward sequence of arm i while X t is the side information 
observed at time t before making the decision. The formal 
parametric setting is as follows. For each configuration pair 
Co = (#i,# 2 ) and each i, the sequence of vectors (X t ,Y^) 
is i.i.d. with joint distribution Gc (dx)Fg i (dy\x), where the 
families {Gc}ce© 2 an d {^('lOlfe© are known to the 
player, but the true value of the corresponding index Co must 
be learned through experiments. For notational simplicity, we 
further assumed is a set of real numbers. 

Note that the concept of the i.i.d. bandit is now extended 
to the assumption that the vector sequence {(Xt,Y^)} t is 
i.i.d. The unconditioned marginal sequence {Y*} remains i.i.d. 
However, rather than the unconditional marginals, the player 



TABLE I 
Glossary 



Not'n Description 
Gc(dx) The marginal distribution of the i.i.d. {X T \ under con- 
figuration C. 

Fg. (dy\x) The conditional distribution of the reward of arm i, Y 4 \ 
under parameter 9i. 

He(x) The conditional expectation of the reward, fJ.g(x) = 
E g {Y\x} = JyF e (dy\x). 

l(Co), The first and the second coordinates of the configuration 

2(C ) pair Co, i.e. 1(C ) = 01, 2(C ) = 6» 2 . For example: 
F i(C )( d y\ x ) = F ei (dy\x) and /j, 2 (c b )( x ) = Ve 2 (x). 

Mc(x) The index of the preferred arm, i.e. 
argmax i= i j2 {^ i (c) (x)}- 
<f>t The decision rule taking values in {f , 2} and depending 

only on the past outcomes and the current side informa- 
tion Xt- 

Ti (t) The total number of samples taken on arm i up to time t, 

T >W = Er=l 1 {0x=O- 
^in/(*) The total number of samples taken on the inferior arm 

up to time t: T inf (t) = £* =1 1{^m 0o (x t )}- 
I(P, Q) The Kullback-Leibler (K-L) information number between 

distributions P and Q: I(P,Q) = E P jlog (^f )}■ 
Fydl , $2 \x) The conditional K-L information number: I(0\ ,92\x) = 
HFe, (■\x),F g ,(-\x)). 



is now facing the conditional distribution of Y t \ which is 
a function of the observed side information X t (and is not 
identically distributed given different X t ). 

The goal is to find an adaptive allocation rule {<p T } to 
maximize the growth rate of the expected reward: 

E Co {W^t)} := E Co j£ (1^=1}^ + l { ^ =2} y r 2 )| , 

or equivalently to minimize the growth rate of the expected 
inferior sampling time 1 , namely Ec {Tj n /(i)}. To be more 
explicit, at any time t, 4> t takes a value in {1,2} and depends 
only on the past rewards (t < t) and the current side 
observation X t . 

We define a uniformly good rule as follows. 

Definition 1 {Uniformly Good Rules): An allocation rule 
{4> T } is uniformly good if for all C G 2 , E c {T in f(t)} = 
o(t a ), Va > 0. 

In what follows, we consider only uniformly good rules and 
regard other rules as uninteresting. Necessary notation and 
several quantities of interest are defined in TABLE I. We 
assume that all the given expectations exist and are finite. 

III. Traditional Bandits 

Under the general formulation provided in Section II, the 
traditional non-Bayesian, parametric, infinite horizon, two- 

'in the literature of bandit problems, the term "regret" is more typically 
used rather than the inferior sampling time. For traditional two-armed bandits, 
the regret is defined as 

regret := t ■ mia( Wl ,/i 82 } - E Co {W0(t)}, 

the difference between the best possible reward and that of the strategy of 
interest {<f> T }- The relationship between the regret and Ti„f(t) is as follows. 

regret = - fj,g 2 | ■ E Co {T mf (t)}. 

For greater simplicity in the discussion of bandit problems with side obser- 
vations, we consider T in f(t) rather than the regret. 
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armed bandit is simply a degenerate case, i.e., the traditional 
bandit problem is equivalent to having only one element in 
X (say X = {x }). This formulation of traditional bandit 
problems is identical to the two-armed case of [14], [8], 
[9]. For simplicity, the argument xq can be omitted in this 
traditional setting, i.e., M c := M c (xq), fig ■= He(xo), 
l(0i,0 2 ) := I(e u e 2 \x ), etc. 

The main contribution of [14], [8], [9] is the asymptotic 
analysis stated as the following two theorems. 

Theorem 1 (log t Lower Bound): For any uniformly good 
rule {<p T }, Ti n f(t) satisfies 

(1-e) io g r 



lim P Co ( T lnf (t) > 



and 



K Ca 
1 



1, Ve > 0, 



lnnhf^Ml> 

t^oo log t K Cq 



where Kc is a constant depending on Cq. If Mc — 2, then 
Tin} (t) — Ti (t) and K Co is defined 2 as follows. 



K Co =ini{I(0 1 ,6) :W,no > no,}. 



(1) 



The expression for K Co for the case in which Mc = 1 can 
be obtained by symmetry. 

Theorem 2 (Asymptotic Tightness): Under certain regular- 
ity conditions 3 , the above lower bound is asymptotically tight. 
Formally stated, given the distribution family {Fg}, there 
exists a decision rule {<j) T } such that for all Co = (#1,6*2) € 
© 2 , 

E Co {T inf (t)} 1 
lim sup 00 \ m/V U < —, 

t^oo log t A Co 

where Kc is the same as in Theorem 1. 

The intuition behind the logt lower bound is as follows. 
Suppose Mc = 2 and consider another configuration C = 
(0,62) such that Mc> = 1. It can be shown that if under 
configuration C = (0i,O 2 ), Ec {T in f(t)} is less than the 
logi lower bound, Ec>{T in f(t)} must be greater than o(t a ) 
for some a > 0, which contradicts the assumption that {4> T } 
is uniformly good. 



. = (0, 00) and X = [0, 1]. X t is beta distributed with 
parameters (0i,0 2 ). 

B. Scheme with Bounded Ec {T in f(t)} 

Consider the following condition. 
Condition 1: For any fixed Co 

inf {p(G Co ,G Ce ) : C e e & 2 , 3x, M Cc (x) ? M Ca (x)} > 0, 

where p denotes the Prohorov metric 4 on the space of distri- 
butions. Two examples satisfying Condition 1 are as follows: 

• Example 1: X is finite, and V.t e X, fig(x) is continuous 
with respect to (w.r.t.) 9. 

• Example 2: F g (-\x) ~ J\f(6x, 1) is a Gaussian distribution 
with mean 6x and variance 1. 

Under this condition, we obtain the following result. 

Theorem 3 (Bounded Ec {Ti n }(t)})-' If Condition 1 is sat- 
isfied, then there exists an allocation rule {<j> T }, such that 
linit^oo E Ca {T in f(t)} < 00 and lim^oo T inf (t) < 00 a.s. 

• Note: the information directly revealed by X t helps 
the sequential control scheme surpass the logt lower 
bound stated in Theorem 1. This significant improvement 
(bounded expected inferior sampling time) is due to the 
fact that the dilemma between learning and control no 
longer exists in the direct information case. 

We provide a scheme achieving bounded Ec {Ti n f(t)} as 
in Algorithm 1, of which a detailed analysis is given in 
Appendix II. 

Algorithm 1 <j> u the decision at time t (after observing X t 
but before deciding <f> t ) 

1: Construct 
C t 



\c e © 2 : p(G c ,L x (t)) < inf p(G c ,L x (t)) + 
L ce& 2 



where Lx{t) is the empirical measure of the side observations {X T } 
until time t, and p is the Prohorov metric as before. 
2: Arbitrarily pick Ct 6 Ct, and set </> t = Mg (X t ). 



IV. Direct Information 

A. Formulation 

In this setting, the side observation X t directly reveals 
information about the underlying configuration pair Co = 
(01,02) m the following way. 

Dependence: Gc 1 — Gc 2 iff Ci = C 2 - 
As a result, observing the empirical distribution of X t gives 
us useful information about the underlying parameter pair Co. 
Thus this is a type of identifiability condition. 

Examples: 

• = (0, 0.5) and X = {xi, x 2 , X3}. 

' k if A: = 1,2 

1 — 9i — 9 2 otherwise 



P(0ue 2 )(Xt - x k ) 



2 Throughout this paper, we will adopt the conventions that the infimum of 
the null set is 00, and i = 0. 

3 If the parameter set is finite, Theorem 2 always holds. If is the set 
of reals, the required regularity conditions are on the unboundedness and the 
continuity of /J.$ w.r.t. 9 and on the continuity of I(0i,8) w.r.t. fig. 



V. Best Arm As A Function Of X t 

For all of the following sections (Sections V through VII), 
we consider only the case in which observing X t will not 
reveal any information about Co, but only reveals information 
about the upcoming reward YJ, that is, 

• Gc does not depend on the value of Co; we use G := 
Gc as shorthand notation. 

Three further refinements regarding the relationship between 
M c (x) and x will be discussed separately (each in one 
section). 

A. Formulation 

In this section, we assume that for all possible C, the side 
observation Xt is always able to change the preference order 
as shown in Fig. 1. That is, 

• For all C G 2 , there exist Xi and x 2 such that 

M c (xi) = 1 and M c (x 2 ) = 2. 
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Fig. 1. The best arm at time t always depends on the side observation Xt- 
That is, for any possible pair (0i,£>2) the two curves, fig 1 (x) and fig 2 (x), 
(w.r.t. x) always intersect each other. 



The needed regularity conditions are as follows. 

1) X is a finite set and Pc(X t = x) > for all x € X. 

2) V0i,02,£> &2\x) is strictly positive and finite. 

3) Vie, n$(x) is continuous w.r.t. 9. 

The first condition embodies the idea of treating X t as 
the index of several different bandit machines, which also 
simplifies our proof. The second condition is to ensure that 
all these different bandit problems are non-trivial, with non- 
identical pairs of arms. 
Example: 

• = (0, oo), X = { — 1,1}, and the conditional reward 
distribution F$(-\x) — Af(9x, 1). 

B. Scheme with Bounded Ec {Ti n f(t)} 

Theorem 4 (Bounded Ec {Ti n f(t)}): If the above condi- 
tions are satisfied, there exists an allocation rule {fa} such 
that 

lim E Co {T in f(t)} < oo. 

t— >oo 

Such a rule is obviously uniformly good. 

• Note: although the side observation X t does not reveal 
any information about Cq in this setting, the alternation 
of the best arm as the i.i.d. X t takes on different values x 
makes it possible to always perform the control part, fa = 
Mq (Xi), and simultaneously sample both arms often 
enough. Since the information about both arms will be 
implicitly revealed (through the alternation of Mc (X t )), 
the dilemma of learning and control no longer exists, and 
a significant improvement (lim^oo Ec {Ti n f(t)} < oo) 
is obtained over the logt lower bound in Theorem 1. 

We construct an allocation rule with bounded Ec {Ti n f(t)} 
given as Algorithm 2. The intuition as to why the proposed 
scheme has bounded Ec {Ti n f(t)} is as follows. The forced 
sampling, Ti(t) < \ft + 1, ensures there are enough samples 
on both arms, which implies good enough estimates of Co- 
Based on the good enough estimates, the myopic action of 
sampling the seemingly better arm, fa+i = Mq (Xt+\), will 
result in very few inferior samplings. Unlike the traditional 
two-armed bandits, in this scenario, the best arm Mc (x) 
varies from one outcome of X t to the other. Therefore, the 

4 A definition of the Prohorov metric is stated in APPENDIX I. 



Algorithm 2 fa, the decision at time t + 1 

Variables: Denote T?(t) as the total number of time instants until time t 
when arm i has been pulled and X T = x, i.e. 



,<#>T=i}' 



and define x* { := argmaxz {Tf (*)} and Tf (t) := max x {Tf(t)}. 
Construct 

c t ■.= {c = ($i,8 2 ) ee 2 : 



with 



a(C, t) 



a(C, t) < mf{o-(C, t) : C e © 2 } + -}, 



p(F 1(C) (-\xl),Lf (*)), 
+p(F 2(c) (-\xl),Lf(t)), 



where Lf(t) is the empirical measure of rewards sampled from arm i at 
those time instants t < t when X T = x. (As before p(P, Q) is the Prohorov 
metric.) Arbitrarily choose Ct S Ct. 



Algorithm: 

if t + 1 < 6 then 

<t> t+1 = (t mod 2) + 1. 
else if 3i such that T 4 (t) < s/t + 1 then 

4>t+i = i- 
else 

<t> t +i =M dt (X t+1 ). 
end if 

(Note that Line 1 guarantees that there is only one i such that Tj(t) < 
y/t + T.) 




Fig. 2. The best arm at time t never depends on the side observation Xt . That 
is, for any possible pair, (#1, 02), the two curves, fj,g 1 (x) and fig 2 (x), do 
not intersect each other. However, in this case, we can postpone our sampling 
to the most informative time instants. 



myopic action and the even appearances of the i.i.d. {X T } 
will eventually make both T\ (t) and T2 (t) grow linearly with 
the elapsed time t, and the forced sampling should occur only 
rarely. This situation differs significantly from the traditional 
bandits, where the forced sampling will inevitably make the 
Ti n f (t) of the order of \/t, which is an undesired result. 

A detailed proof of the boundedness of Ec {Ti n f(t)} for 
this scheme is provided in APPENDIX III. 

VI. Best Arm Is Not A Function Of X t 

A. Formulation 

Besides the assumption of constant G, in this section, we 
consider the case in which for all C E & 2 , Mc(x) is not 
a function of x, and we thus can use Mc ■— Mc(x) as 
shorthand notation. Fig. 2 illustrates this situation. 
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The needed regularity conditions are similar to those in 
Section V: 

1) X is a finite set and Pc(X t = x) > for all x e X. 

2) V#i,#2,£, l(0i, 02\x) is strictly positive and finite. 

In this case, one arm is always better than the other no 
matter what value of X t occurs. The conflict between learning 
and control still exists. As expected, the growth rate of 
the expected inferior sampling time is again lower bounded 
by logt, but with the additional help of X t we can see 
improvements over the traditional bandit problems. 
To greatly simplify the notation, we also assume that 
4) For all x, the conditional expected reward fig(x) is 
strictly increasing w.r.t. 0. 
This condition gives us the notational convenience that the 
order of (p,g 1 (x), fig 2 (x)) is simply the same as the order of 

(01,02). 

Example: 

. = (l,oo), X = {1,2,3}, and the conditional reward 
distribution Fg(-\x) ~ J\f(0x, 1). 

B. Lower Bound 

Theorem 5 (log t Lower Bound): Under the above assump- 
tions, for any uniformly good rule {<f) T }, T in f(t) satisfies 

lim P Co (T inf (t) > (1 ~! )l0S ^ = 1, Ve > 0, 

t^oo \ K Co ) 



and 



t — *oo 



K Co 



(2) 



where Kc is a constant depending on Co- If Mc a = 2, 
then T in f(t) = Ti(t). The constant Kc can be expressed 
as follows. 



K Co = inf sup{I(9 u 0\x)}. 

9:&>&2 sex 



(3) 



The expression for Kc for the case in which Mc a = 1 can 
be obtained by symmetry. 

Note 1 : if the decision maker is not able to access the side 
observation X t , the player will then face the unconditional 
reward distribution J Fg i (dy\x)G(dx) rather than Fg^dylx). 
Let 7(6*1 > 02) denote the Kullback-Leibler information between 
the unconditional reward distributions. By the convexity of the 
Kullback-Leibler information, we have 

sup 7(6>i, %) > / I(6i,0\x)G(dx)> l(0i,0). 

x J x 

This shows that the new constant in front of logt, in (3), 
is no larger than the corresponding constant in (1), and the 
additional side information Xt generally improves the decision 
made in the bandit problem. As we would expect, Theorem 5 
collapses to Theorem 1 when |X| = 1. 

Note 2: This situation is like having several related bandit 
machines, whose reward distributions are all determined by the 
common configuration pair (01,0?) ■ The information obtained 
from one machine is also applicable to the other machines. 
If arm 2 is always better than arm 1, we wish to sample 
arm 2 most of the time (the control part), and force sample 
arm 1 once in a while (the learning part). With the help of the 



side information X t , we can postpone our forced sampling 
(learning) to the most informative machine X t — x. As a 
result, the constant in the log t lower bound in Theorem 1 has 
been further reduced to this new 



K c - 

A detailed proof of Theorem 5 is provided in APPENDIX IV. 

C. Scheme Achieving the Lower Bound 

Consider the additional conditions as follows. 

1) is finite. 

2) A saddle point for Kc exists; that is, for all 0\ < #2, 



inf supl(0i,0\x) 



s uP 9 i£f^/(6>i,6i|20- 



With the above conditions, we construct a logi-lower- 
bound-achieving scheme {4> T }, which is inspired by [12]. The 
following terms and quantities are necessary in the expression 

of {</> r }. 

. Denote C t := (0 a , 9 ). Instead of the traditional (0 U 2 ) 
representation, we use (0 a ,0@). Based on this representa- 
tion, we are able to derive the following useful notation: 

aAf3 := mm(0 a ,0 ) 

a A/? := argmin^",^) 

aVf3 := max(0 a ,0 ) 

aV/3 := argmax(<T,6^). 

For instance, if 9 a < 9 13 , fig* (x) = ^g^Ap (x); arm a A /3 
represents arm 1; Y t aV0 is the reward of arm 2; and 
T inf (t) = T aA p(t) =Ti(t). 
• Choose an e such that < e < 
±mm{p(Fg(-\x),Fi>(-\x)) : Vx e X, 9 ^ d e ©}, 
where p is the Prohorov metric. The whole system 
is well-sampled if there exists a unique estimate 
Ct — (9 a ,0P), such that the empirical measure Lf(t) 
falls into the e-neighborhood of F^^^x), for all 
x e X and i e {1,2}. That is 



< e, Vx e X,i e {1,2}. 



3C t , s.t.p(Lf(t),F. (<?t) (-|aO 

For any estimate Ct — (9 a 1 9 13 ), define the most infor- 
mative bandit according to Ct as 



x*(C t ) 



arg max inf 

x 6:9>0 aV P 



9,x), 



and K t (Ct,9) to be the conditional likelihood ratio be- 
tween the seemingly inferior arm 9 aA @ and the competing 
parameter 9: 



A t (C t ,0):= [] 



Fg 



s(dY?f m) \ar(C t j) 

{dY::/ (m) \^(Ct)) 



where t x * (m) denotes the time instant of the m-th pull 
of arm a A (3 when the side observation X T = x*(C T ). 



Set a total number of |X| 



101 



|0r counters, 



including |X| counters, named "ctr(a;)"; |0| 2 counters, 
named "ctr(C)" for all possible C e 2 ; and |0| 3 
counters, named "ctr(C7, 0)" for all possible C and 0. 
Initially, all counters are set to zero. 
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Algorithm 3 <fit+i, the decision at time t + 1 

I: if there exists i G {1, 2} and x G X such that Tf(t) = 0, then { 
CondO} 

2: 
3 
4 
5 
6 
7 



<t>t+i <- (t + 1) mod 2. 
else if the whole system is not well-sampled or 6> a = 8@, then {Condi } 
ctr(X t+ i) *— ctr(X t +i) + 1 and 4> t +l <— ctr(X t+ i) mod 2. 

else if O!V/3 = 8 := max©, then { Cond2} 

<t>t+i f— 1 if it is 9 a = 0. Otherwise, fa+i <- 2. 

else { Cond3} 

8: ctr(Ct) «- ctr(C%) + 1. 

9: if ctr(C t ) is odd, then { Cond3a} 

10: <j> t+ i *- M c , t (X t +i). 

11: else { Cond3b} 

12: 8* <- argmm{A t (Ci,0) : > 6» aV ' 3 }. 

13: if X t+ i = s*(Ct) then { Cond3b1} 

14: itA t (C t ,8*) < t{\ogt) 2 , then { Cond3b1a} 

15: ctr(t\,0*)^ctr(t\,0*) + l. 

16: if 3k G N s.t. ctr(C* t ,0*) = k 2 , then { Cond3b1a1} 

17: </>t+i «— k mod 2. 

18: else { Cond3b1a2} 

19: t+ i <- 3 - M 6 (X t+1 ). 

20: end if 

21: else { Cond3b1b} 

22: <j> t+1 <- M d (X t+1 ). 

23: end if 

24: else { Cond3b2} 

25: <t> t+ i <- M d (X t+ i). 

26: end if 
27: end if 
28: end if 



Theorem 6 (Asymptotic Tightness): With the above condi- 
tions, the scheme described in Algorithm 3 achieves the log t 
lower bound (2), so that this {<fi T } is uniformly good and 
asymptotically optimal. 

A complete analysis is provided in APPENDIX V. 

VII. Mixed case 

The main difference between Sections V and VI is that in 
one case, for all possible Co, X t always changes the preference 
order, while in the other, for all possible Co, X t never changes 
the order. A more general case is a mixture of these two. In 
this section, we consider this mixed case, which is the main 
result of this paper. 

A. Formulation 

Besides the assumption of constant G, in this section, we 
consider the case in which for some C G 2 , Mc(x) is not a 
function of x. For the remaining C, there exist x\ and X2 s.t. 
Mc(xi) = 1 and Mc{x2) — 2. For future reference, when 
the configuration pair Co satisfies the latter case, we say the 
configuration pair Co is implicitly revealing. Fig. 3 illustrates 
this situation. 

However, without knowledge of the authentic underlying 
configuration Co, we do not know whether Co is implicitly 
revealing or not. In view of the results of Sections V and VI, 
we would like to find a single scheme that is able to achieve 
bounded Ec {7i„/(i)} when being applied to an implicitly 
revealing Co, and on the other hand to achieve the \ogt lower 
bound when being applied to those Co which are not implicitly 
revealing. 

The needed regularity conditions are the same as those in 
Sections V and VI: 




Fig. 3. If 82) = (8 a , $!,), the best arm depends on x, i.e. fig 1 (x) and 
fig 2 (x) intersect each other as in Section V. If (81,82) = (81,, 8 C ), the best 
arm does not depend on x, i.e. fig (x) and fj,g 2 (x) do not intersect each 
other as in Section VI. 



1) X is a finite set and Pc(X t = x) > for all x € X. 

2) \/9 1 ,92,x, I(9i,02\x) is strictly positive and finite. 

To simplify the notation and the following proof, we define a 
partial ordering as 9 -< $ iff Vx, fig(x) < p,&(x), and 9 >~ & is 
defined similarly. Note that for a configuration Co = (^1,^2), 
it can be the case that neither 9\ -< 92 nor 6\ >- #2- 

Example: 

• = (0,oo), X = { — 1,1} and the conditional reward 
distribution F g (-\x) - N{9 2 - to, 1). Then C = 
(9i,9 2 ) = (0.1,0.2) is implicitly revealing, but C = 
(0, 10) is not. 

B. Lower Bound 

Theorem 7 (log t Lower Bound): Under the above assump- 
tions, for any uniformly good rule {<f> T }, if Co is not implicitly 
revealing, Ti n f(t) satisfies 

lim P Co (T mf (t) > (LzA^il) = i, Ve > 0, 



and liminf^WM> 1 



i , T . ■ (4) 

where Kc is a constant depending on Co- If Mq = 2, 
Tinf(t) — T\(t), and the constant Kc can be expressed as 
follows. 

K Co = inf svv{I{9 x ,9\x)}. 

{e-3x , S.t. [i.e(xo)>He 2 (xQ)} x 

The expression for Kq for the case in which Mc — 1 can 
be obtained by symmetry. 

The only difference between the lower bounds (2) and (4) is 
that, in (4), Kc has been changed from taking the infimum 
over {9 > ^2} = {Vx,/ig(x) > /ze 2 (x)} to a larger set, 
{9 : 3x,fi$(x) > /i0 2 (x)}. The reason for this is that under 
this case, consider a 9 for which there exists x such that 
fi$(x) > fig 2 (x). If the authentic configuration is C = (9, 92) 
rather than (#1 , #2), a linear order of incorrect sampling will be 
introduced, which violates the uniformly-good-rule assump- 
tion. As a result, a broader class of competing distributions 
C = (9,92) must be considered, i.e., we must consider a 
different set of configurations, over which the infimum is 
taken. 

A detailed proof is contained in APPENDIX VI. 
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C. Scheme Achieving the Lower Bound 

Consider the same two additional conditions as those in 
Section VI. 

1) is finite. 

2) A saddle point for Kc exists; that is, for all 6\, 

inf sup I(0i, 9\x) 

{e-.3x , He (xo)>p.e 2 (x )} x 

= sup inf I(6i, 0\x). 

x {0:3x o , M (xo)>M 2 (x o )} 



Algorithm 4 4>t+\, me decision at time t + 1 

1: if there exists i 6 {1, 2} and x e X such that Jf (t) = 0, then { 
CondO} 

2: <f>t+l <— (t + 1) mod 2. 

3: else if the whole system is not well-sampled or 9 a = 0@, then {Condi } 
4: ctr(X t+ i) <- ctr(X t+ i) + 1 and (ft+l <~ ctr(X t+1 ) mod 2. 
5: else if there exists i 6 {1, 2}, such that V6, x, ^b(x) < ti i (Q t \(x), then 



{ Cond2} 

6: <j>t+i *— h where i is the satisfying index. 

7: else if Ct is implicitly revealing, then { Cond2.5} 

8: <t> t+ i <-M dt {X t+1 ). 

9: else { Cond3} 

10: ctr(C t ) <- ctr(C t ) + 1. 

11: if cti(C t ) is odd, then { Cond3a} 

12: 4> t+1 <- M dt (X t+1 ). 

13: else { Cond3b} 

14: 0* <- argmin{A t ((5t,0) : \/6,3x , s.t. ng(x ) > fi gaV fi(x )}. 

15: if X t +i =x*(C t ) then { Cond3b1} 

16: UA t (Ct,e*) < t(\ogt) 2 , then { Cond3b1a} 

17: ctr(Ci,6>*)<-ctr(C*t,6>*) + l. 

18: if 3k e N s.t. ctr(C*i,6»*) = fc 2 , then { Cond3b1a1} 

19: </>t+i <— ^ mod 2. 

20: else { Cond3b1a2} 

21: 4>t+i *- 3 - M 6 (X t+1 ). 

22: end if 

23: else { Cond3b1b} 

24: cf>t+l <- M d (X t+1 ). 

25: end if 

26: else { Cond3b2} 

27: <t> t+ i <- M & (X t+1 ). 

28: end if 

29: end if 

30: end if 



A proposed scheme is described in Algorithm 4, which is 
similar to the scheme in Section VI-C. The only differences 
are the insertion of Cond2.5, Lines 7 and 8; the modification 
of Cond2, Lines 5 and 6; and the modification of Cond3b, 
line 14. 
Notes: 

1) When the estimate Ct — (9 a ,8^) is not implicitly 
revealing, an ordering between 8 a and 9 13 exists. As a 
result, all notation regarding a V (3, 9 ay P, etc., remains 
valid. 

2) The definition of A t (C t , 9) is slightly different. For any 
estimate Ct — (9 a ,9 13 ) that is not implicitly revealing, 
we can define the most informative bandit according to 

C t as 

x'(C«) (5) 
:=argmax _ inf 7(0"^, 0\x), 

x {0:3x o , l j, e (x o )> f ± aaV { j (x o )} 

and A t (C t ,0) to be the conditional likelihood ratio 
between the seemingly inferior arm 9 aA/3 and the com- 



peting parameter 9. That is, 



MCt,9):= [] 





\ T x »(m) 


x*(Ct)) 




1 dY a ^ , 

\ t x * (m) 


x*(C t )) 





where t x * (m) denotes the time instant of the m-th pull 
of arm a A (3 when the side observation X T = x*(C T ). 
(The difference between this new A t (Ct , 0) and the 
previous one in Algorithm 3 is that we have a new 
x*(C t ) defined in (5).) 
Theorem 8 (Asymptotic Tightness): With the above condi- 
tions, the scheme described in Algorithm 4 has bounded 
lim 4 Ec {T in /(i)}, or achieves the \ogt lower bound (4), 
depending on whether the underlying configuration pair Co 
is implicitly revealing or not. 

A detailed analysis is given in APPENDIX VI. 

VIII. Conclusion 

We have shown that observing additional side information 
can significantly improve sequential decisions in bandit prob- 
lems. If the side observation itself directly provides informa- 
tion about the underlying configuration, then it resolves the 
dilemma of forced sampling and optimal control. The expected 
inferior sampling time will be bounded, as shown in Section 
IV. If the side observation does not provide information on 
the underlying configuration (9i,02), but always affects the 
preference order (implicitly revealing), then the myopic ap- 
proach of sampling the seemingly-best arm will automatically 
sample both arms enough. The expected inferior sampling time 
is bounded, as shown in Section V. If the side observation 
does not affect the preference order at all, the dilemma still 
exists. However, by postponing our forced sampling to the 
most informative time instants, we can reduce the constant in 
the log t lower bound, as shown in Section VI. In Section VII, 
we combined the settings of Sections V and VI, and obtained 
a general result. When the underlying configuration Co is 
implicitly revealing (such that X t will change the preference 
order), we obtain bounded expected inferior sampling time 
as in Section V. Even if Co is not implicitly revealing (in 
that Xt does not change the preference order), the new logt 
lower bound can be achieved as in Section VI. Our results are 
summarized in TABLE II. 

Appendix I 

Sanov's theorem and the Prohorov metric 

For two distributions P and Q on the reals, the Prohorov 
metric is defined as follows. 

Definition 2 (The Prohorov metric): For any closed set 
A C M and e > 0, define A e , the e-flattening of A, as 

A" := \x e M : inf \x - y\ < el . 
I yeA J 

The Prohorov metric p is then defined as follows. 

p(P, Q) := inf {e>{): P(A) < Q(A e ) + e, 
for all closed A c E.} . 
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TABLE II 

Summary of the Benefit of the Side Observations and the Required Regularity Conditions . 



Characterization 


Regularity Conditions 


Results 


Or, + Gr< iff Gi ± Co 


As Ht — » (In V-r M- Cxi — M>. |V1 


md> \ st VCn lirru IT- j-(fH <r oo 


(i) Constant G c , i.e., G c ■= G, 

(ii) VC, Bxi,x 2 , s.t. M c (x\) = 1, 
Mc(%2) = 2 (implicitly revealing). 


(i) X is finite. 

(ii) V0i ^ 6» 2 ,x, < /(0i,0 2 |rr) < oo. 
(hi) Vx, Hg(x) is continuous w.r.t. 9. 


3{<M such that VG , limt E Co {T inf (t)} < oo. 


(i) Constant G c , i.e., G c ■= G, 

(ii) VC, Mc(x) only depends on C, 
not on a?. 


(i) X is finite. 

(ii) V0i 2 , x, < J(0i,0 2 |x) < oo, 
(hi) Vx, fJ,g(x) is strictly increasing w.r.t. 0. 


For any uniformly good {</>t}, we have 

lim E ^W)} ^ i 
llm * logi - K C ' 

K Cn ■= inf e sup^, /(0i,0|a:). 

For finite 0, 3{0 r }, s.t. limt E ^o^/W> < _L_. 


(i) Constant Gc, i.e., Go := G, 

(ii) The underlying Go may be implic- 
itly revealing or not. 


(i) X is finite. 

(ii) V0i ^ 2 , < 7(0i, 2 |x) < oo. 


For any uniformly good {4>t}, if Go is not implic- 
itly revealing, we have lim t °°\ogt^ ~ — 
i\C n :=infesup x 7(0i,0|a;). 
For finite ©, 3{</>t} s.t. 

(1) if Go is implicitly revealing, Ec {Ti„f(t)} < oo, 

(2) if Go is not i.r., limt Ec ° ^ W} < _L_. 



The Prohorov metric generates the topology corresponding 
to convergence in distribution. Throughout this paper, the 
open/closed sets on the space of distributions are thus defined 
accordingly. 

Theorem 9 (Sanov's theorem): Let Lx{n) denote the em- 
pirical measure of the real-valued i.i.d. random variables 
Xi, X 2 , ■ ■ ■ , X n . Suppose Xi is of distribution P and consider 
any open set A and closed set B from the topological space 
of distributions, generated by the Prohorov metric. We have 

lim inf — log P p (Lx (n) G A) > - inf I{Q,P) 

ri^oo n QeA 

lim sup — log Pp (L x (n) e B) < - inf I(Q,P). 

Further discussion of the Prohorov metric and Sanov's theorem 
can be found in [23], [24]. 

Appendix II 
PROOF OF Theorem 3 

Proof: For any underlying configuration pair Co = 
(81,62), define the error set C e as follows. 

C e := \J{Ce& 2 :M c (x)^M Co (x)}. (6) 
sex 

Let C e denote the closure of C e . By Condition 1, Co ^ C e . 
For any t, we can write 

Pc (<t>t + M Co (X t )) = P Co [M 0t (X t ) ? M Cg (X t )) 

< P Ca (3x,M 6t (x)^M Co {x)) 
= Pc (c t e C e ) 

< Pc (c t e C e ) . 

Let e = I inf {p(G c , Gc e ) ■ C e e C e }, which is strictly 
positive by Condition 1, and consider sufficiently large t > 
\. If p(Gc Q ,L x (t)) < e, then by the definition of C t , 



p(Gg , Lx(t)) < e + e = 2e. By the triangle inequality, 
p(Gc a , Gq) < 3e and C t ^ C e . As a result, 

{C t E C e } C {p(L x (t),G Ca ) >e} = K t . 

K t is a closed set. By Sanov's theorem, the probability 
of K t is exponentially upper bounded w.r.t. t, and so is 
Pc (pt € C e ^j . As a result, we have 

t 

lim E Ca {T lnf (t)} - lim V P Co (0 r ^ M Ca (X T )) < 00. 

t— >oo t— >oo A — ' 

T=l 

By the monotone convergence theorem, the expectation of 
(t) is finite, which implies that lim^oo T lnf (t) 
is finite a.s. ■ 

Appendix III 
PROOF OF Theorem 4 

Similarly, we define C e as that in (6). We need the following 
lemma to complete the analysis. 

Lemma 1: With the regularity conditions specified in Sec- 
tion V, 3ai,a 2 > such that Pc„(C t e C e ) < 
ai exp (-a 2 min {Tf *(t), Tf * (t)}). 

Proof of Lemma 1: By the continuity of ^e(^) w.r.t. and 
the assumption of finite X, it can be shown that Cq E C^. 5 
Therefore there exists a neighborhood of C , Cg = (9i — 
S, 9 1 +5)x (8 2 - 5,9 2 + 5), such that C 5 cqoC e c C^. 

Define a strictly positive e > as follows. 

e:=±wi{p(F Si (-\x),F ei (-\xj) : 

VxeX,ie{i,2},(^,^)eCg}. 

We would like to prove that for sufficiently large t> \, 
{C t € C c 5 } C {3i,p{Lf(t),Fe t (-\x*)) > e} . 

5 Cg denotes the complement of C e . 
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Suppose p(Lf{t),F 6i {-\ x *)) < e for both i = 1,2. By the 
definition of a(Co,t), we have 



a(C ,t) < 2e. 



(7) 



However, for those Ct <G by the definition of e, for some 
i € {1, 2}, we have 

<r(C t ,t) > (■!**),£?*(*)) 

> 3e, (8) 

which contradicts the definition of C t since (7) and (8) imply 
a{Ct,t) > 7 + c(Cb,i). As a result, for sufficiently large i, 
we have 

{c t eC e } c {ft e C^} 

C {3i,p(Lf(t),F ei (-\x*))>e} 

= U {p(Lf{t),Fo t (-\x*j)>e}-<9) 

i=l,2 

By Sanov's theorem, the probability of each term in the union 
of the right-hand side of (9) is exponentially bounded w.r.t. 
Tf (t). As a result, the probability of this finite union is 
bounded by a\ exp (— a 2 min {Tf (t), T% (t)}) for some ai, 
a 2 > 0. ■ 
Analysis of the scheme: We first use induction to show 
that Vt > 6, Tj(i) > yfi. This statement is true for t = 6. 
Suppose T t (t - 1) > Vi - 1- If - 1) > Vt, by the 
monotonicity of Tj(t) w.r.t. t, we have T;(t) > Tj(t-l) > V*. 
If Tj(t — 1) < by the forced sampling mechanism, 

Ti(t) = Ti{t - 1) + 1 > y/t^T + 1 > Vt. 

We consider the event of the inferior sampling at time (t + 

1): 

?4 M Co (X t+1 )} = [<t> t +i + M Co (X t+1 ),Ct 6 C e } 

U {</. t+ i #M O0 (Jf«+i),C t G C=} 

C [Ct G C e } 

u{«t+i ^M Co {X t+1 ),C t G C=} 



A t+ i UB f+ i. 



(10) 



Since T, (i) > we have min^ Tj (t) > yfi and 

min; Tf*(t) > T^j. By Lemma 1, we have P Co L4 t+ i) < 

aie - a2 m, and hence Y^t+i=7 p c (^t+i) < °o. 
For Pc (B t +i), we can write 

B t +i = {<h+i*M 6t (Xt+i),d t eCl} 

C {min{Ti(t)}i<Vt + T,CteCl} 

= {min{T i (t)}j = v't6N,Q6C^ 

C {Bi,</> a iLi,Vae(To,t],Ti(To) = Vi} 

= BtViUBf+i, (11) 

where To = (yfi — l) 2 + 1 and Bj +1 , Bf +1 correspond to i = 
1,2, respectively. The first equality comes from the fact that 
since C t S C=, M Co (X 4+ i) = M^pQ+i). The first subset 
sign comes from the fact that <pt+i f Mg (Xt+i) implies 



the decision rule <pt+i is in the stage of forced sampling. The 
second equality follows by combining both the inequalities: 
min{Tj(t)}j > yfi and min{T;(t)}i < yff+T and the fact 
that both t and Ti(t) are integers. 

The reasoning behind the second subset inequality is as 
follows. By again using the fact that Xi(i) > yfi and sub- 
stituting r for t, we have yfi — 1 < Tj(ro) and thus have 
Ti(ro) = yfi = Ti(t), which guarantees that arm i has not 
been sampled from time t + 1 to t. 

By the symmetry between B\ +1 and Bf +1 , we can consider 
only B] +1 for example. We have 

Pc„ (Bl +1 ) 

< P Co (M da i (X a ) = 2,Va G (r ,t]) 



n 



< (l -min{P G (X t = r 

\ X 



(M^ ^X^ = 2| M^ ^X,,) = 2,V6 G (r , a)) 



») 



t — TQ 



(12) 



The first inequality comes from the definition of {<fi T } which 
implies that if Ti(t ) = T\(t) > yfi, the forced sampling 
mechanism is not active during the time interval (t , t] . So 
4> a = 2 implies i (X a ) = 2, Va e (to,*]. The second 

inequality comes from the assumption of i.i.d. {X T }, which 
implies that X a is independent of Cb and Xb for all b < a. 
Since at least one x will make M@ (X a ) = 1, each term in 
the product is then upper bounded by 1 — min x {Pa(X t = x)}. 
It is worth noting that by the regularity assumption on G, 
1 — mm x {Pa(Xt = x)} is strictly less than 1. 

Then from (11), (12), and the union bound, we obtain 
Pc„(B t+1 ) < Pc «i) + Pc (5 t 2 +i) < 2a*-((^-D 2 +i) 
for some a < 1. Hence J2t+i=7 Pc (Bt+i) < oo. From (10), 
we conclude that 

lim E Co {T mf (t)} 

t—>oo 

OO 

< 6+ J2 (Pc„(A T+1 ) + P Co (B T+1 ))<™, 

T + l = 7 

which completes the proof. ■ 

Appendix IV 
Proof of Theorem 5 

Proof: The proof is inspired by [14]. Without loss of 
generality, we assume Mc = 2, which immediately implies 
Tinf(t) = Ti(t). Fix a with pg > pg 2 , and define C = 
(6, Of). Let A(n) denote the log likelihood ratio between 9\ 
and 9 based on the first n observed rewards of arm 1 . That is 



\{n) := ^ log , x , 



(m)J _ 

where T(m) is a random variable corresponding to the time 
index of the m-th pull of arm 1. 

By conditioning on the sequence {X r / m ^}, X(n) is a sum 
of independent r.v.'s. Let Kc ■— sup x£X {/(0i, 0\x)}, and 
suppose there exists 8 > such that 

A(n) T ^ 
lim sup > Kc> + o, 
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with positive probability. Then with positive probability, there 
exists an x such that the average of the subsequence for 



which X 1 



(m) 



xq, will be larger than K c > + 6. This, 



however, contradicts the strong law of large numbers since the 
subsequence is i.i.d. and with marginal expectation I(0i, 6\xq). 
Thus we obtain 



lim sup 



A(n) 



<K C >, Pc - 



(13) 



The inequality (13) is equivalent to the statement that with 
probability one, there are finitely many n such that A(n) > 
n(Kc + 6) for some 6 > 0. And since Kc> > 0, this 
in turn implies there are at most finitely manly n such that 
A (to) > n(Kc> + 6). As a result, we have 

max TO <„A(m) 
limsup = < Kc>, r'co ~ a - s -> 

n^oo ^ 

and lim P Co (3m < n, A(m) > (1 + 5)nK c >) = 0. (14) 
Henceforth, we proceed using contradiction. Suppose 

Using A\ and A 2 as shorthand to denote events A\ := 

{TiW < TT^r} and A * ■= { W)) ^ i iTW 1 }' 

and by (14), we have 



limsup Pen {M n A 2 ) > 0. 

t — »oo 

The quantity Ec {Ti n f(t)} can be rewritten as follows. 

Ec {T inf (t)} 



(15) 



(o) 
(6) 



Ec {T 2 (t)} 



(c) 

> t 



e c , {t-n{t)} 

log* 



(d) 

> I t - 



> It- 



(/) 



(1 + 2S)K C , 

log* 
(1 + 2<5)X C , 

log* 
(1 + 25)*T C , 



Pc> (Ai) 
Pc {M nA 2 ) 

(1 + .5) logt 

e i+2* P Co (AinA 2 ) 



(16) 



The equality marked (a) follows from Mc = 1 and (6) 
follows from the fact that T 1 (t)+T 2 (t) = t. (c) and (d) follow 
from elementary probability inequalities, (e) follows from the 
change-of-measure formula and the definition of A 2 in which 

A ( T i(*)) < (1 (i+25) St - (/) follows from sim P le arithmetic and 
Eq. (15). 

The inequality (16) contradicts the assumption that {4> T } is 
uniformly good for both Co = #2) and C" = (0, #2), and 
thus we have 

l_^Pco(w>^^)=0, V £ >0. 

By choosing the 6 in C" = (0,9 2 ) with the minimizing 
configuration infg^ sup^. we complete the proof 

of the first statement of Theorem 5. The second statement 
in Theorem 5 can be obtained by simply applying Markov's 
inequality and the first statement. ■ 



Appendix V 
PROOF OF Theorem 6 

We prove Theorem 6 by decomposing the inferior sampling 
time instants into disjoint subsequences, each of which will 
be discussed in separate lemmas respectively. For simplicity, 
throughout this proof, we use l{Cond1 (t)} as shorthand for 
l{Cond1 is satisfied at time t} 6 , and use <5-nbd(G) to denote 
the c) -neighborhood of the distribution G{x) on the L°° space 

of distributions. 

Suppose Mc a = 2. To prove that for the {4> T } in Al- 
gorithm 3, lim sup, E co?i(*) < 1 — we 

' Jr-t^oo logt — mfe>e 2 moxx I(8i,8\x) ' 

first note the following: 

Ti(t) (17) 

= E = 1} 

T=l 

t 

= ^ = i ; Cond0(r)} 

T=l 

t 

+ £ = 1, Condi (r)} 

T = l 
t 

+ E ^ = l<Cond2(r)} 

T = l 

t 

+ l{<t>-r = 1, C T _i = (9°, 13 ), 9° < 0^ 2 , Cond3(r)} 

T = l 
t 

+ J2 H<t>T = 1, C T _i = (9°, 13 ), 6»! / 9° > 13 , Cond3(r)} 

T=l 

t 

+ J2 1{<Pt = 1, CV-i = (9°, 13 ), 0i # 0" < 6C 3 = 02, Cond3(r)} 

T = l 
t 

+ ^ 1{0 T = 1, c T _! = (0°, 13 ), 0! = e a > 13 , Cond3(r)} 

T = l 
t 

+ H4>-r = l,CV_i = (0°,0' 3 ) = Co = (0i,0 2 ),Cond3(r)}. 

r = l 

These eight terms of the right-hand side of (17) will be treated 
separately in Lemmas 2 through 8. 

Lemma 2: Suppose Mc = 2, i.e., Q\ < 9 2 . 7 Then 

VCo G © 2 , lim E Co |^l{^ r = l,Cond0(r)}} 



vT = l 



< lim E Co I l{Cond0(r)} I < 00. 



I T=l J 

Prao/- Let TO := J2? = i l{Cond0(r)}. By the mono- 
tone convergence theorem, it is equivalent to prove that 
E Co {TO} < 00 for all C . By the definition of CondO, we 
have 

Pc o (T0 = t) 

< E p G(Xt = x, Vt < t and r = t mod 2, X T 7^ x) 
16X 



< (l-mnP G (I, =x))' 



By directly computing the expectation, we obtain Ec o {T0} < 
00. ■ 

6 "At time t" means after observing Xt but before the final decision <f>t is 
made. It is basically the moment when we are performing the <j>t -deciding 
algorithm. 

7 There is no need to consider the case 9i = 02, since in that case, all 
allocation rules are optimal. 
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Lemma 3: Suppose Mp = 2, i.e., 9\ < 9 2 . Then 



lim E Co £ l ^ 



1, Condi (r)} 



< Hm E C J ^l{Cond1(r)} I < oo. 

U=l J 

Proof: We define Lx(i|Cond1 ) as the empirical distri- 
bution of X T at those time instants r < t for which Condi 
is satisfied. We then have 

£l{Cond1(r)} 

t 

= l{Cond1 (t),L x (r|Cond1) e 5-nbd(G)} 

T = l 

t 

+ l{Cond1(r),L x (r|Cond1) $ <5-nbd(G)}. (18) 

T=l 

By Sanov's theorem on finite alphabets (see [24]), each term in 
the second sum is exponentially upper bounded w.r.t. r, which 
implies the boundea expectation of the second sum. For the 
first sum, we have 

i 

J2 l{Cond1(r),L x (r|Cond1) e 5-nbd(G)} 

T = l 

OO 

< st L f(T~l) i e-nbd(F fll (-|x)), 

T = l 

and L x (r|Cond1) £ 5-nbd(G)} 

2 oo 



< E E E > 



L x (r|Cond1) e 5-nbd(G)} 

r'P G (X = a: )(l-5) - 1 



s.t. p(Lf(n),F fll (.|x))> e }. 



(19) 



(20) 



The first inequality comes from extending the finite sum to the 
infinite sum and the definition of Condi . The second inequal- 
ity comes from the union bound. The third inequality comes 
from the following three steps. First we change the summation 
index from the time variable r to r', which specifies that it is 
the r'-th time that the condition in (19) is satisfied. (Note: by 
definition, r > r'.) Second, by Lx^Condl ) e 5-nbd(G), 
there must be at least t'Pq(X = x)(l — 5) time instants that 
X s = x, s < t, which guarantees we have enough access to 
the bandit machine x. And finally, by the definition of Condi 
in Algorithm 3, at the r'-th time of satisfaction, the sample 
size n must be greater than T Pc o^ x ~^ { - 1 6 J. 1 _ By slightly 
abusing the notation Lf(t) with Lf(n), where n represents the 
sample size T?(t) rather than the current time t, we obtain the 
third inequality. 

Remark: this change-of-index transformation will be used 
extensively throughout the proofs in this section. 

By Sanov's theorem on R (Theorem 9), the probability of 
each term in (20) is exponentially upper bounded w.r.t. r', 
which implies that the summation has bounded expectation. 
By (18), the proof of Lemma 3 is then complete. ■ 

Lemma 4: Suppose Mc = 2, i.e., 9\ < 9 2 . Then 



Urn E Co jXj 1 ^ = l,Cond2(r)}| < oo. 



Proof: By the assumption 9\ < 9 2 , we have 

t 

^l{0r = l,Cond2(r)} 

T = l 

t 

= ^i{cv_ 1 = (0<V 3 ),0 Q =e} 

T=l 
t 

= HCt-i = (e a ,e l3 ),e a = §,L x (T\e a =e)e <s-nbd(G)} 

r=l 

i 

+ HCt-i = {o a ,e l3 ),6 a = e,L x (r\e a =e)$ 5-nbd(G)} 

r=l 

By Sanov's theorem on finite alphabets, each term in the 
second sum is exponentially upper bounded w.r.t. r, which 
implies the bounded expectation of the second sum. For the 
first sum, we have 

t 

Y 1{CV-1 = (e a ,9 l3 ),e a = 6,L x {r\e a = §) g «5-nbd(G)} 

r=l 

oo 

< Y HCt-i = (6 a ,e f3 ),d a = e,Bx, s.t. 

r=l 

p(Lf(r-l),F ei (-\x)) > e,L x (r\e a =e) G <5-nbd(G)} 

OO 

^ E E lS ^ n ^ [ t ' P g(.X = x)(l - S) - 1] , s.t. 

x r'=l 

p(Lf(n),F $1 (-\x))>e}. 

By extending the finite sum to the infinite sum, we obtain 

the first inequality. By the definition of Cond2 in Algorithm 3 

and using exactly the same reasoning used in going from (19) 

to (20), we obtain the second inequality. By Sanov's theorem, 

each term in the above sum is exponentially upper bounded 

w.r.t. t'. Thus it follows that the expectation of the first sum 

is also finite, which completes the proof. ■ 
Lemma 5: Suppose M Co = 2, i.e., 9 1 <9 2 . Then 

A™ E c IE 1 {<t>r = 1 >CV-i = (0'V 3 ), 

Lt=1 

e a < e () / 6» 2 ,Cond3(r)}} 

< ltaEcJ^i {dr-i = (0 a ,e' 3 ) : 

It=1 

e a < e 9 + e 2 ,Cond3(r)}} 

< oo. 

Proof: We have 

t 

Y i{cV-i = (e a ,e' 3 ),e a < e* 3 # e 2 ,Cond3(r)} 

E X; lfC^^! = (0,^),Cond3(^)} 

(9,^)-.e<^ 7 te 2 t=i 
t 

<2 Y, E HCt-i = (0,iJ),Cond3a(T)} 

(e,tf) : e<i?^e 2 T=1 

t 

= 2 E E E = z.CV-i = (9,i?),Cond3a(r)} 

(e,i?):8<i)^e2 x T=1 

^ 2 E ^ E !{p(i2( T - 1 ). ir e 2 ('l x )) > e,Cond3a(T)} 

< 2 E EE > - !] - s - L 

(8,0) : e<<)^£)2 x T ' = i 

p(if(n),F e2 (.|x)) > e}. 
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The first equality follows from conditioning on the event 
that the exact value of the estimate Cy_i is some configuration 
pair (9, •&). The first inequality follows from the definition of 
Cond3a in Algorithm 3, where double the number of time 
instants with odd ctr(C*) will be larger than the total number 
of times that Cond3 is satisfied. The second equality follows 
from conditioning on the value of X T . The second inequality 
follows from the condition that the second coordinate of the 
estimate, # ^= 9 2 , and then extending the finite sum to the 
infinite sum. The third inequality follows from the definition 
of Cond3a and changing the time index to r', similarly to 
the reasoning in (19)-(20). By Sanov's theorem, each term is 
exponentially upper bounded w.r.t. r', and thus the entire sum 
has bounded expectation. The proof is thus complete. ■ 

Corollary 1: By the symmetry of {(f> T }, we have 

lim 

t^oo 

Ec o |E = i.cv-i = (e a ,e l3 ),ei ± e a > e",Cond3(r)}| 

< oo. 

Lemma 6: Suppose Mc — 2, i.e., Oi < 62. Then 

6>i ^ a < 6 13 = 6 2 , Cond3(r)} } < 00. 
Proof: We have 

t 

Y t i{4>r = i,cv_i = (e a ,8i 3 ),e 1 + e a < = e 2 ,Cond3(r)} 

T — 1 

t 

E E ^ = l ' d T-i = ( e ' 9 ^> C° n d3(r)} 

e-.e 1 ^o<e 2 t=i 
t 

Y 5Z(l{0r = l,C T -i = (6»,6»2),Cond3b1a1(r)} 
e-.e 1 jt6<e 2 t=i 

+ lj> T = 1,CV- 1 = (6»,e 2 ),Cond3b1a2(r)}) 

< E E 

00 

i{0 T = l,CV_i = (0,e 2 ),6* = e',Cond3b1a1(r)} 

T = l 

OO 

+ 1{0 T = l,CV-i = (e,e 2 ),e* =e',Cond3b1a2(r)}X2l) 

r=l 

00 

< E E E 

6:8 l ^e<6 2 8':0'>9 2 r'=l 

(4(r' + l) 2 - (2r') 2 ) • l{3n > [r' - 1], s.t. 

p(Lf (9 ' ) (n),F 9l (.|a ; *(e')))>4- (22) 

The second equality comes from the fact that the scheme 
samples the inferior arm only when either Cond3b1 a1 or 
Cond3b1 a2 is satisfied. For the first inequality, we condition 
on 9* and extend to the infinite sum. For the last inequality, 
we change the time index to r', which specifies the r'-th 
satisfaction of Cond3b1 a1 , so that we can upper bound the 
first sum of (21). The reason we have a multiplication factor 
(4(r' + l) 2 - (2t') 2 ) in front of the indicator function is 
in order to upper bound the second sum of (21), concerning 
Cond3b1 a2, simultaneously. 



To obtain this result, we note that between the consecutive 
times t' and r' + l, at which Cond3b1 a1 is satisfied and arm 1 
is pulled, the number of times that Cond3b1 a2 is satisfied and 
arm 1 is pulled cannot exceed (2(r' + 1)) 2 — (2t') 2 — 1, which 
is because of the algorithm involving ctr(Ct,0*) in Line 16. 
Multiplying the factor (4(t' + 1) 2 — (2t') 2 ), we simultaneously 
bound these two sums. 

By Sanov's theorem, the expectation of the indicator in (22) 
is exponentially upper bounded w.r.t. r'. As a result, the entire 
sum will have bounded expectation, which in turn completes 
the proof. ■ 

Lemma 7: Suppose Mc = 2, i.e., 6\ < 6> 2 . Then 

9 1 = 6 a > 6^,Cond3(r)}} < oo. 
Proof: We have, 

i 

Y l{cf>T = l,CV-i = (e a ,d ),9 1 =9 a > ^,Cond3(r)} 

r=l 

i 

= E E 1 Wr = l,C T _i = (ei, 1 ?),Cond3(r)} 

#:«1>i9 r=l 
i 

< Y E H^r-l =(«!,*), Cond3(T)} 
i9:S 1 >i9 r=l 

< Y ( 2 E H^r-i = (ei,*),Cond3b(r)} + l] 

#:9l>i9 V r=l / 
= #{tfe©:tf <0i}+ ^ 

l9:01>l? 

[2 Y E 1 {C'r-i = (ei,i9),r=6t',Cond3b(r)}] 
= £ : < 0i} 

+ 2 E EE 

#:01>l9 e':e'>9i T=l 

(l{CV-i = (6i,#),B* =0', 

Cond3b(r),L x (r|Cond3b) e 5-nbd(G)} 
+ l{CV_i = (6>i,i?), 6>* =e', 

Cond3b(r), L x (r|Cond3b) ^ <J-nbd(G)}). (23) 

The first inequality follows from Line 1 1 in Algorithm 3, 
where Cond3b is satisfied once after two times of Cond3 sat- 
isfaction. The last two equalities come from conditioning on 9* 
and Lx(r|Cond3b). By Sanov's theorem on finite alphabets, 
the terms of the second sum in (23) are exponentially upper 
bounded and the entire sum thus has bounded expectation. For 
the first sum, we have 
t 

£i{£ T -i = (0i,0),0*=0, 

T = l 

Cond3b(T),L x (r|Cond3b) e S-nbd(G)} 
1 

" P G (X = x*(9 u $))(l-S) ' 
t 

l{CV-i = (0i, 0), 0* = 0', Cond3b1 (r)}. 

T=l 

(24) 
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This inequality follows from the fact that once Lx falls into 
the <5-nbd(G), the total number of time instants can be upper 
bounded by the number of instants when X T = x*(6i,'&), 
over P G (X = z*(6>i,tf))(l - S). To show 



HmEc, i !{CV_i = (0i,tf),0* = 9', Cond3b1 (r)} \ < oo, 

t ^°° (t = 1 J 

we further decompose the expectand into 

t 

l{CV-i = (0i,i?),0* = 6»',Cond3b1(r)} 

T = l 

i 

= J2 HCt-i = (0i,$),8* = 6»',Cond3b1a(r)} 

r=l 

+ E l{C T -i = {8i,$),e* = 6»',Cond3b1b(r)}. (25) 



follows by replacing the minimum achieving 9' with 9 2 - The 
third inequality follows from expressing A r using its defini- 
tion. The fourth inequality follows from the set relationship, 
where n is T% ^ 9i '^\t), the number of time instants that the 
side information X s = x*(9i 1 d) and cf) s = 2, for s < r. 

We first note that ]J m=1 ^(dy^^n^)) 18 a P osltlve 
martingale with expectation 1, when being considered under 
distribution Fg 2 (-\x* (9i,i))). By Doob's maximal inequality, 
we have 

/ ™ F^dY 2 |z*(0i,tf)) \ ! 

Pc„ 3n < r, - 7 ^ — — > r(logr) 2 ] < 



and thus the expectation is bounded, i.e., 



-r(log t) 2 



For the first sum in (25), under the assumption 9\ > d, we 
can write 

t 

l{CV-i = (0i,i?),0* = 0',Cond3b1a(r)} 

T = l 

OO 

< 1{Ct-i = (0i,#),0* = 0',Cond3b1a1(T)} 

T = l 

OO 

+ l{CV-i = (0i,i9),0* = 0',Cond3b1a2(r)} 

r=l 

oo 

< 1 + 2^ l{C T -i = (0i,i?),0* = 0',Cond3b1a2(r)} 

T=l 

OO 

< 1 + 2 E l{p(3n > [t - 1], s.t. 

t' = 1 

p(L** (« 1 '"(n),F, J (.|i , (»i,tf))) > e}- (26) 

The first inequality comes from conditioning on the sub- 
conditions Cond3b1a1 and Cond3b1a2, and extending to 
the infinite sums. Let SQ„ denote the set of perfectly squared 
integers in {1, - ■ ■ ,n}. The second inequality is from the 
definition of Cond3b1a1 in Algorithm 3 and the fact that 
Vn e N, |SQ„| is no larger than 1 + |{1, • • • ,n}\SQ„|. The 
third inequality comes from the fact that by definition, under 
Cond3b1 a2, <p T = 2, and changing the time index to r', the 
number of satisfaction times. By Sanov's theorem on M, the 

above has bounded expectation. 

For the second sum of (25), with the condition 9\ > $ 

t 

l{CV-i = = 6>',Cond3b1b(r)} 

T = l 

t-1 

< 1+ H&r = (0l,<?),0* = e',A T (C T ,e') > r(logr) 2 } 

T = l 

t-1 

< 1+ £ 1{C T = (0i,tf),A T (CV,0 2 ) > r(logr) 2 } 

T = l 

r?) 

<i+eh n 5 L;r ( 'LI >-a°^)M 



T = l 
t-1 



^ F, 2 (dy 2 „ (m) |**(0i,0)) 



^ A F,(dY 2 .^"(ei,,?)) 2 

- 1+ £ 1{3 "^< st Jl.^C^ra >r(IogT) } ' 

where Y 2 , » is the reward of arm 2 at the m-th time that 
X s = x*(6i,'&) and cf) s = 2. The first inequality follows from 
focusing only on the A T _i(Cy-i> 9') condition in Cond3b1 b 
and then shifting the time index r. The second inequality 



-C 



l{C T -i = (6>i,i9), (T = 0',Cond3b1b(r)} 



oo 1 



< OO. 



(27) 



By (23), (24), (25), (26), and (27), Lemma 7 is proved. 
Lemma 8: Suppose Mq, = 2, i.e., #i < 62- Then 



lim sup ■ 

i^oo logt 

E Q> {E = ^-i = ( e "' e/3 ) = (»i,to),Cond3(T)} 



~~ inig > e 2 max x I(6i,0\x) 

Proof: By the definition of {<f) T }, especially of 
Cond3b1 a, we have 



Co = (0i,0 2 ),Cond3(T)} 



E Co E ! K = 1 ' (5 - 



< Ec |E HCt-i = C ,Cond3b1a(r)}| 

< E Cq {sup{l < n < t - 1 : 



< Ec {sup{l < n < 00 : 

g F. l( ^|..|,-(Co)) a \ 



— Ec D •! max sup{l < n < 00 : 
[ Q>(>2 



iii F e (dY^ |a ,»|x*(c )) - v ; 



li F e (dY^ \x*(C )) 



where Yh x , denotes the reward of the m-th time that arm 1 

of the sub-bandit machine X T = x*(Cq) is pulled. The 
first inequality follows because, by definition, only when 

Cond3b1a is satisfied can <p T = 1, given C T -\ = Co- 
The second inequality is obtained by focusing on the sub- 
condition A T (C T ,6) in Cond3b1a, and letting n = Tf (t-1) 
be the number of time instants when arm 1 is pulled and 
X T = x*(Co). The third inequality comes from extending 
the upper bound of n from t — 1 to 00. The equalities come 
from rearranging the max and min operators and elementary 
implications. By applying Lemma 4.3 of [12], quoted as 
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Lemma 9 below, we have 

lim sup 



t^oo log t + 2 log log t 



-C 



e>e 2 



1 



mm 9>92 E Co | log [ Fe(dY r xt}x * {CQ)) ) \ 



min 9>fl2 7(fli,e|x*(Co)) 
1 



maJd inf 9> e 2 J(6»i, 0|x) 



inf9>e 2 maxt J(0i, 0|x) ' 

where the equalities come from the existence-of-saddle-points 

assumption. By noting that log t ^> 2 log log t, this completes 

the proof of Lemma 8. ■ 
By (17) and Lemmas 2 through 8, it has been proved that 
for the {4> T } described in Algorithm 3, 

1 



,. E Co {T iu/ (f)} 
lim sup — < 

i^oo log* Kc 



VC S & 2 



Lemma 4.3 of [12] is quoted as follows. 

Lemma 9 (Lemma 4.3 of [12]): Suppose Y\, Y2, ■ ■ ■ are 
i.i.d. r.v.'s taking values in a finite set Y, with marginal 
mass function p(y). Let f e : Y — > M be such that < 
E p {f e (Y 1 )} < 00, V6> e &, where is a finite set. Define 



L A = max 0e © L e A . 



limsup : < 



EtLiH^tS? < A}, and 



(28) 



A — »oo 

A min e ^0 Ep{/*(Yi)} 

Note: by incorporating Cramer's theorem during the proof 
of this lemma in [12], it can be extended to continuous r.v.'s 
Y U Y 2 ,---, provided E p {|/ e (yi)|} and E p {|/ e (Fi)| 2 } are 
finite for all 6. 

Appendix VI 
PROOF OF Theorems 7 AND <S 

Proof of Theorem 7 flog t lower bound): This proof is 
basically a variation of that for Theorem 5, with the major 
difference being that the competing configuration C = (9, 62) 
is now from a different set: {9 : 3xo, fj,e(xo) > fJ-e 2 { x o)}- We 
can first follow line by line in the proof of Theorem 5, and 
replace (16) with the following inequality. 

Ec {T inf (t)} 



> E, 

(b) 



Y / l{<f> T = 2,M c ,(X T ) = l} 

r=l 
i 

£ 1{M C ,(X T ) = 1}- 

r=l 

t 

^1{^ T = 1,M C ,(X T ) = 1} 

T = l 

i t 

^l{M ,(X r ) = l}-^l{^ T = l} 



(c) 

> 



(d) 
> 

(e) 
> 

(/) 



E c , {TTt-Tl(t)} 

log* 



7Tt - 



7Tt - 



(1 + 2S)K C :) 

log* 
(1 + 2S)K C , 

log* 



(1 



(^1) 

P c > (Ai nA 2 ) 

(1 + 5) logt 



e P Co (Ai n A 2 ) 



where the first inequality comes from dropping the other half 
of the events where {(f> T = 1, Mc(X T ) = 2}. The second 
inequality comes from dropping the condition Mc(X T ) = 1. 
With 7T := P G (M C "(X T ) = 1) > 0, recalling that 9' satisfies 
that 3xo, such that Mc(xo) = 1, we obtain (b). (c)-(f) 
follow from the same reasoning as discussed in connection 
with (16). From the contradiction of the uniformly-good-rule 
assumption, we have 

ii m Pc fr l( t)>ii-i^)=o, v e >o. 

t^oo \ Kc> J 

By choosing the 9 in C = (6, 62) with the minimizing config- 
uration inf {£):32; s>t M{x)>M2{x)} swp x I(9 1 ,9\x), the proof 
of the first statement in Theorem 7 follows. The second 
statement in Theorem 7 can be obtained by simply applying 

Markov's inequality and the first statement. ■ 
Proof of Theorem 8 (bound-achieving scheme): Follow- 
ing the same path as in the proof of Theorem 6, we first 
decompose the inferior sampling time instants into disjoint 
subsequences, each of which will be discussed separately. 

Ti„f(t) 



]Tl{<^M Co (X T )} 

r=l 
t 

H4>t ^M Co (X T ),CondO(r)} 

T = l 

t 

+ £ 1{4> T ^ M Ca (X T ), Condi (t)} 

T = l 

t 

+ E 7^ M Co (X T ),Cond2(r)} 

T = l 
t 

+ E ^ ^ M Co (X T ),Cond2.5(r)} 



£ i{<p T + m Co (x t ), cv_i = (e a ,e^), 

T = l 

e a -< 6C 3 + e 2 ,Cond3(r)} 

t 

1{4>t ^ M Ca (X T ), C T _i = (e«,^), 

T = l 

0! ^ 6» a >- 9 13 , Cond3(r)} 

t 

- E 1{<)!. T ^ M Co (X T ),6r-l = (6 a ,el } ), 
r=l 

6»! ^ 6» a -< e' 3 = 6» 2 ,Cond3(r)} 

t 

^ 1{0 T ^ M Co (X T ),C T _i = (ff-.O. 

r=l 

e 1 = e a ^e 13 ^ 6» 2 ,Cond3(r)} 

t 

^ i{0 T ^ m Co (x t ),c t -i = (e",^) = Co = (01,02), 



Cond3(r)}. 



(29) 



By exactly the same analysis as in Lemmas 2 and 3, 
the first two sums in (29), concerning CondO and Condi, 
have bounded expectations. Let 9 denote the configuration 
satisfying Vx , ^e(xo) < fig(xo). For the sum concerning 
Cond2, {4> T ^ M Co (A T ),Cond2(T)} implies it is either 
r = ^ 8 l or 0? = + 82, where (9 a ,9< 3 ) = C r _i. 
Both of the above cases are discussed in Lemma 4 and are 
proved to have finite expectations. 
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For future reference, we denote the five different sums 
concerning Cond3 as term3a, term3b, term3c, term3d, and 
term3e, in order. By Lemma 5 and Corollary 1, both term3a 
and term3b have bounded expectations. 

If the underlying Co is not implicitly revealing, by Lem- 
mas 6 and 7, term3c and term3d have bounded expectation. 

And by Lemma 8, lim sup t ^^^j——^ < Kc Q ■ 

If the underlying Co is implicitly revealing, term3e = 0. 
For term3c and term3d, we have 



Proof: 



£l{^Mc (X T ),Cond2.5(r)} 

-- (M),Cond2.5(r)} 
= (6>,i?),Cond2.5(r), 



T=l 

< 



E Encr 



£ ¥= Mc {x T ).,c T - 1 = (e a ,e f) ), 

T=l 

6» x ji 9" x e" = 2 ,Cond3(-r)} 

t 

+ Y, ^ ^ m Co {x t ),c t - 1 = (9 a ,e 13 ), 

T=l 

9 1 = 9 a y & # 2 , Cond3(r)} 

t 

< H^v = i, cv_i = 6C 3 ), 6»! # x e 13 = e 2 , Cond3(r)} 

T=l 

t 

+ 1{<Pt = 2,CV_i = (9°, e 13 ),!?! ^ 6»° X 6C 3 = e 2 ,Cond3(r)} 

T=l 

t 

+ = 1, CV-i = (»°, e 13 ), fi = 9° X 6C 3 ^ 6 2 , Cond3(r)} 

T = l 
t 

+ J2 H<t>r = 2,CV-i = (9°',9 /3 ),0 1 = 9" y 9 !) ^ e 2 ,Cond3(r)}, 

T = l 

(30) 



E E^-i = 

(9,>)):(9,tf)#Cor=l 

Lx(r|Cond2.5) e 8-nbd{G)} 

t 

+ E £l{CV-i = (M),Cond2.5(r), 

J L x (r|Cond2.5) i S-nbd(G)} (31) 

By Sanov's theorem on finite alphabets, each term in the 
second sum is exponentially upper bounded w.r.t. r, which 
implies that the second sum has finite expectation. For the 
first sum, we have 



^]l{C r _i = (M),Cond2.5(r), 

r=l 

Lx(r|Cond2.5) e <5-nbd(G)} 

t 

< ^l{C T _i = (M),0^0i,Cond2.5(r), 

T=l 

L x (r|Cond2.5) e c5-nbd(G)} 

t 

+ Y, HCt-i = (M), ^ 9 2 , Cond2.5(r), 

J L x (r|Cond2.5) e <5-nbd(G)}, (32) 



T=l 



which is obtained by replacing the condition </> T 7^ Mc (X T ) 
with either T = 1 or <\> T = 2. By Lemma 6, both the first 
and the fourth sums in (30) have bounded expectations. By 
Lemma 7, both the second and the third sums in (30) also 
have bounded expectations. 

Note: in the proofs of Lemmas 6, 7, and 8, there are 
summations or minima taken on the set {9 > 6@}. All those 
sets could be replaced by {9 : 3x , s.t. ^e(xo) > /j.0p(xo)} 
and the rest of the proofs still follow. 

We have discussed all sub-sums in (29) except the sum re- 
garding Cond2.5. It remains to show that the sum concerning 
Cond2.5 has bounded expectation, which is addressed in the 
following lemma. 

Lemma 10: Consider the {4> T } described in Algorithm 4. 
For all possible Co, we have 



lim E Co (V 1{0 T ? M Co (X T ),Cond2.5(r)}l < 00. 



which is obtained by considering whether 9 7^ 9\ or d 7^ 
9 2 , recalling that 7^ Co- Since these two sums are 

symmetric, henceforth we show only the finite expectation of 
the first sum in (32). The finite expectation of the second sum 
then follows by symmetry. 

i 

!{CV_i = (0,0),M0i,Cond2.5(r), 

r=l 

L x (r|Cond2.5) e 5-nbd(G)} 

t 

< ^l{3a;, s.t. M (e ^(x) = l,p(L?(r- l),F 9l (-|x)) > e, 

r=l 

Cond2.5(r),L x (r|Cond2.5) e 5-nbd(G)} 

CO 

^ E E H 3 " ^ KPg(^ = x)(l - 5)], s.t. 

a::M (£),i>)( :c ) = 1 T ' = 1 

p(Lf(n),F tfl (•!*)) >e} (33) 



The first inequality comes from the definition of Cond2.5: 
since C T _i = (8, i?) is implicitly revealing, there must be 
an x s.t. Ma = 1. And since the estimate 9 7^ 6\, for 
that specific x, the distance between L\ and Fg 1 (-\x) must be 
greater than e. The second inequality comes from changing 
the time index to t', the time instants at which X s = x and 
Cond2.5 is satisfied, and extending the summation to infinity. 
(This change of the time index is similar to the one described 
in (19)-(20)). 
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Thus by Sanov's theorem on R, the expectation of each term 
in (33) is exponentially upper bounded w.r.t. r', which implies 
finite expectation of the entire sum in (33). By the discussions 
on (31), (32), and (33), Lemma 10 is proved. ■ 

From the above discussion of the sub-sums in (29), we 
conclude that the modified scheme, {4> T } in Algorithm 4, 
has bounded Ec {Tinf(t)} if the underlying Co is implicitly 
revealing. If Co is not implicitly revealing, the {4> T } in 
Algorithm 4 achieves the new logi lower bound (4). ■ 
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